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Abstract 


Objectives: A rapidly developing scenario like a pandemic requires the prompt production of high-quality systematic reviews, which 
can be automated using artificial intelligence (AI) techniques. We evaluated the application of AI tools in COVID-19 evidence syntheses. 

Study Design: After prospective registration of the review protocol, we automated the download of all open-access COVID-19 system- 
atic reviews in the COVID-19 Living Overview of Evidence database, indexed them for Al-related keywords, and located those that used AI 
tools. We compared their journals’ JCR Impact Factor, citations per month, screening workloads, completion times (from pre-registration to 
preprint or submission to a journal) and AMSTAR-2 methodology assessments (maximum score 13 points) with a set of publication date 
matched control reviews without AI. 

Results: Of the 3,999 COVID-19 reviews, 28 (0.7%, 95% CI 0.47—1.03%) made use of AI. On average, compared to controls (n = 64), 
AI reviews were published in journals with higher Impact Factors (median 8.9 vs. 3.5, P < 0.001), and screened more abstracts per author 
(302.2 vs. 140.3, P = 0.009) and per included study (189.0 vs. 365.8, P < 0.001) while inspecting less full texts per author (5.3 vs. 14.0, 
P = 0.005). No differences were found in citation counts (0.5 vs. 0.6, P = 0.600), inspected full texts per included study (3.8 vs. 3.4, 
P = 0.481), completion times (74.0 vs. 123.0, P = 0.205) or AMSTAR-2 (7.5 vs. 6.3, P = 0.119). 

Conclusion: AI was an underutilized tool in COVID-19 systematic reviews. Its usage, compared to reviews without AI, was associated 
with more efficient screening of literature and higher publication impact. There is scope for the application of AI in automating systematic 
reviews. © 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http:// 
creativecommons.org/licenses/by-nc-nd/4.0/). 
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What is new? 


Key findings 
e The use of artificial intelligence (AI) in COVID-19 
systematic reviews was very low. 


e COVID-19 reviews using AI tools showed higher 
publication impact and workload savings. 


What this adds to what was known? 

e Semi-automated screening and RCT filtering are 
the most notable use-cases of AI tools in evidence 
synthesis. 


There is a lack of systematic review tools cohe- 
sively integrating AI. 


What is the implication and what should change 

now? 

e There is scope for the application of AI in auto- 
mating systematic reviews going forward. 


1. Introduction 


Evidence-based medicine depends on the production of 
timely systematic reviews to guide and update health care 
practice and policies [1]. This is a resource-intensive under- 
taking, requiring teams of multiple reviewers to interrogate 
numerous repositories and databases, screen through thou- 
sands of potentially relevant citations and articles, extract 
the pertinent data from the selected studies, and then pre- 
pare cohesive summaries of the findings [2,3]. In the 
context of the SARS-CoV2/COVID-19 pandemic, methods 
to speed up this lengthy process were urgently needed [4,5]. 

Systematic evidence synthesis relies on robust and stan- 
dardized procedures to achieve dependable results. Howev- 
er, the call to accelerate research output during the 
pandemic led to a decrease on reviews’ methodological 
quality [6,7] and the ascend of “rapid reviews” [8,9] 
(which shorten the usual timeframes by sacrificing on 
search depth, screening robustness or data extraction and 
at the expense of increased risk of errors). Are these un- 
avoidable tradeoffs for timelier results? 

Instead, artificial intelligence (AI) based solutions (that 
automate parts of the workflow by mimicking human 
problem-solving, comprising machine-learning, nature lan- 
guage processing, data mining and other subfields) [10] are 
now available to either complement or substitute human ef- 
forts with limited risk of bias [1 1—13], and have been pre- 
viously (but scarcely) [14] employed in evidence synthesis 
to enhance screening [15] and data extraction [16,17]. Their 
aims are to shorten production times, allow for broader 
screenings of the literature and reduce reviewers’ work- 
loads without compromising on methodological quality. 


Here, we evaluated the use of AI techniques among 
COVID-19 evidence syntheses to empirically determine 
whether, compared to COVID-19 evidence syntheses 
without AI, they impacted on the production, the quality, 
and the publication of systematic reviews. 


2. Materials and methods 


This methodological study [18] is reported following 
PRISMA 2020 guidelines [19] (checklist provided as 
Supplementary material 1A), and its protocol was prospec- 
tively registered at Open Science Forum Registries (DOI 
10.17605/OSF.IO/HSDAW) [20]. 


2.1. Search and selection of reviews 


We considered for inclusion all COVID-19 related sys- 
tematic reviews that could have made use of any AI tool (ma- 
chine learning, deep learning, or natural language 
processing) to accelerate, improve or complement any aspect 
of the review conduct (search, screening, data extraction and 
synthesis). We implemented a script (available at DOI 10. 
506 1/dryad.9kd5 1c5j6) [21] to process all COVID-19 biblio- 
graphic references registered in the COVID-19 Living Over- 
view of Evidence (L:OVE) database [22], filtering articles 
classified as ‘“‘systematic review” between December Ist, 
2019 and August 15th, 2021, and then querying the “Unpay- 
wall” database [23] for every extracted DOI to obtain a JSON 
record with download links. The process was repeated three 
times since the publication of our protocol to reduce the loss 
of articles due to server-side errors (last searched on August 
17th, 2021). 

To capture reviews which deployed AI, we constructed a 
list of keywords with high probability of appearing in papers 
with AI tools (Supplementary Material 1B). We indexed 
every downloaded file with the OpenSemanticSearch search 
engine, running on a local Linux virtual machine. Every file 
that matched any of our keywords was manually inspected 
independently by two authors (JRTH and RFL). Pre-prints 
and non-English articles were included. The only exclusion 
criterion applied was non-open access status, due to the need 
to evaluate the methods section of each included review. To 
create a comparison group with sufficient statistical power 
of reviews without AI, for each included review we used 
the obtained records to randomly select three controls with 
the same publication date (within a 1-day margin if not 
enough articles were available for a given date). In addition, 
we located and included for analysis all previous versions of 
reviews labeled as living or “‘updated”’. 


2.2. Data extraction 


The following data were manually extracted indepen- 
dently by two authors (JRTH and RFL) from each review: 
type of review (as described by its authors: standard, rapid/ 
scoping, living, or update of a prior version); disclosed 
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Bibliographic records identified from 
the COVID-19 Living OVerview of 
Evidence (L-OVE) 

(n = 7050) 


Identification 


Reviews assessed for availability 
(n = 5731) 


Reviews sought for download 
(n = 4515) 


Reviews screened 
(n = 3999) 


Reviews assessed for eligibility 
(n = 580) 


Reviews using Al (n = 20) 
Prior versions located (n = 8) 


Bibliographic records removed: 
No DOI provided (n = 795) 
Duplicates (n = 275) 

Not indexed in the Unpaywall 
database (n = 249) 


Reviews excluded: 
Non-open access (n = 373) 
Full text not available as a PDF 
file (n = 843) 


Reviews not retrieved: 
Duplicates (n = 369) 
Broken link (n = 147) 


Reviews not matching any search 
term (n = 3419) 


Reviews used as controls (n = 60) 
Prior versions located (n = 4) 


Fig. 1. Flowchart of included reviews: Flowchart of records obtained, screened, assessed for eligibility, and included in our study. 


funding and conflicts of interest information; publication 
status, 2020 Journal Citation Reports (JCR) Impact Factor 
of the publishing journal and number of citations received 
(up to August 17th, 2021); number of abstracts screened, 
full texts reviewed and included studies; number of authors 
and of reviewers participating in the screening; and dates of 
protocol registration (if available) and of the review’s 
earliest version. For living and updated reviews, we 
computed the increase in records screened and included be- 
tween each of their versions and attributed their citation 
count to the newest one (to avoid double counting). Excel 
was used to record all variables. 

Three authors (JRTH and RFL, assisted by CAP) graded 
all reviews with the AMSTAR-2 quality appraisal and risk 
of bias rating [24]. We excluded items 11-12 and 15, which 
apply to meta-analyses (as pre-specified by our protocol) 
and gave 0.5 points for “‘partial YES”? answers when appli- 
cable, making for a maximum score of 13 points. For living 
and updated reviews, we only evaluated their most recent 
version (to avoid double counting). For reviews that 


included both randomized controlled trials and observa- 
tional studies, question 9 (assessment of the risk of bias 
of individual studies) was graded separately for each study 
type. The list of the quality items evaluated is provided as 
Supplementary material IC. 


2.3. Data synthesis 


We calculated the ratios of abstracts screened and full texts 
inspected per author (as workload measurement) and per 
included study (screening precision). The number of re- 
viewers participating in the screening was reported inconsis- 
tently between studies and was therefore not used in the 
calculations. We calculated the completion time of the pre- 
registered reviews as the difference between their protocol’s 
date and the first pre-print’s date of publication (or reception 
date at the journal, for published articles with no pre-prints 
available). Living and updated reviews’ completion times 
were calculated as the difference between the publication 
dates of each of their versions. We excluded non pre- 
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Table 1. Extracted variables for artificial intelligence (Al) and control reviews: We used Pearson’s chi-square test to compare the proportions of rapid, 
living, funded, and published reviews, and the Wilcoxon—Mann—Whitney test for the rest of the comparisons. Medians and IQR (Q1-Q3) are 


rounded to the nearest integer 


Al group (n = 20) 


Characteristics n (%) 
Rapid reviews (25%) 
Living reviews (25%) 
Received funding 12 (60%) 
Published 12 (60%) 
Median IQR 

Journals’ JCR Impact Factor 9 (4—40) 
Citations per month 1 (O—13) 
Abstracts screened 

Per author 302 (127-804) 

Per included study 189 (94—366) 
Full texts inspected 

Per author 5 (4—16) 

Per included study 4 (2—5) 
Days to completion 74 (48-118) 
AMSTAR-2 rating 8 (5-9) 


Controls (n = 60) 


n (%) A x? P-value 
6 (10%) 15% 2.846 0.092 
(5%) 20% 6.667 0.010 
27 (45%) 15% 1.351 0.245 
48 (80%) —20% 3.200 0.074 
Median IQR Wilcoxon W P-value 
3 (3-6) 409.0 <0.001 
1 (O—3) 647.0 0.600 
140 (44—378) 1,126.0 0.009 
27 (14-64) 1,443.0 <0.001 
14 (7-37) 504.5 0.005 
3 (2—6) 883.5 0.481 
123 (53-221) 183.5 0.205 
6 (4—8) 740.5 0.119 


A, absolute differences in percentage points between Al and control reviews; x, test statistic for Pearson’s chi-square test; Wilcoxon W, test 


statistic for the Wilcoxon—Mann—Whitney rank sum test. 


registered reviews from this metric due to heterogeneity in the 
reporting of their starting dates. We used Pearson’s chi-square 
test to compare the percentage of rapid, living, funded, and 
published reviews between groups. Publishing journals’ JCR 
Impact Factor, citation counts, screening workloads, comple- 
tion times and AMSTAR-2 ratings were presented as medians 
with interquartile ranges (IQR), represented using box-and- 
whisker diagrams and compared using the Wilcox- 
on—Mann—Whitney test. R version 4.0.5 was used for statis- 
tical computing, and GraphPad Prism 9.2.0 for graphing. We 
also provided a narrative description of reviews using artificial 
intelligence, detailing which parts of the review process were 
automated and what software they used, how the AMSTAR-2 
ratings differed among them, and how authors justified or what 
impact they attributed to the use of AI tools. 


3. Results 
3.1. Search and selection of reviews 


As outlined in Figure 1, we identified 7,050 biblio- 
graphic records of COVID-19 systematic reviews, success- 
fully downloaded 3,999, and manually inspected 580 that 
matched some of our keywords. We selected 20 reviews, 
of which there were 8 prior versions, making a total of 
28 reviews (0.7% of the total, 95% CI 0.47—1.03%) with 
use of AI. Of the 60 articles selected as publication-date- 
matched controls, we located another 4 prior versions, mak- 
ing a total of 64 articles without use of AI. The complete 
list of selected articles is provided as an Excel document 


(Supplementary Material 2, sheet “Included reviews”) with 
all the extracted variables and the AMSTAR-2 quality ap- 
praisal’s breakdown for each question. The full list of 
manually inspected and finally discarded articles is also 
provided (sheet “Excluded reviews”). 


3.2. Description of the included reviews 


Extracted variables are summarized in Table | and can 
be visualized in Figure 2. Of the 20 reviews selected for us- 
ing AI, there were five rapid reviews (25%, with one 
scoping review and one rapid evidence map) and five living 
reviews (25%). Fifteen reviews provided a conflicts of in- 
terest statement, of which 12 (60%) declared having 
received external funding; 12 (60%) were published. Of 
the 60 control reviews, there were 6 rapid reviews (10%, 
with one scoping review) and three living reviews (5%). 
Fifty-seven reviews provided a conflicts of interest state- 
ment, of which 27 (45%) declared having received external 
funding; 48 (80%) were published. JCR Impact Factors and 
citation counts showed high variability in the AI group, 
mainly due to the inclusion of three BMJ [25—27], two Co- 
chrane [28,29] and one Lancet [30] reviews. Furthermore, 
only 10 reviews in the AI group (50%) and 22 in the con- 
trols (36%) pre-registered a protocol, making for a total of 
44 data points for the completion times’ calculation. 


3.3. Comparison of AI reviews with controls 


The AI group included a higher proportion of living re- 
views than the controls (5/20 vs. 3/60, 95% CI absolute 
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E Al group Controls 


B 


Rapid Living Funded Published 


Cc 
40,000 Abstracts screened 
1,000 : - 


100 j 
10 


per included study 


per author 


300 


200 


100 


Days to completion 


Impact Factor Citations/month 


D 


Full texts inspected 


per included study 


per author 


15 


10 


AMSTAR-2 rating 


Fig. 2. Characteristics of the included reviews: Box-and-whisker diagram (the boxes enclose the Q1-Q3 quartiles, their middle lines represent the 
median, and whiskers extend to the furthest data points within 1.5 IQR). Panel A compares the proportion of rapid, living, funded, and published 
reviews between groups; Panel B presents the journals’ 2020 JCR Impact Factors and citation counts of each group; Panels C and D show authors’ 
workload measurements: abstracts screened and full-texts inspected, per author and per included study; Panel E exhibits the average times to 
completion (in days) of the reviews in each group; and Panel F represents their measured AMSTAR-2 ratings. 


difference 0.2—39.8%, P = 0.010), while showing no differ- 
ences in rapid reviews (5/20 vs. 6/60, 95% CI —5.4 to 35.4%, 
P = 0.092), funding (12/20 vs. 27/60, 95% CI —9.9 to 39.9%, 
P = 0.245) or publication status (12/20 vs. 48/60, 95% CI 
—43.7 to 3.7%, P = 0.074). JCR impact factors among pub- 
lished reviews in the AI group were significantly higher than 
the controls (median [IQR]: 8.9 [3.9—39.9] vs. 3.5 [2.6—5.5], 


P < 0.001); citation counts showed no differences (0.5 
[0.0—13.5] vs. 0.6 [0.0—2.8], P = 0.600). 

Concerning the workload measurements, the AI group 
screened more abstracts per author (302.2 [126.7—804.3] 
vs. 140.3 [43.8—378.2], P = 0.009) and per included study 
(189.0 [94.1—365.8] vs. 26.9 [13.7-64.1], P < 0.001), 
while inspecting less full texts per author (5.3 [3.7—16.1] 
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Fig. 3. AMSTAR-2 methodology appraisals’ summary: The graph on the top shows the average ratings obtained in each of the evaluated questions 
by the reviews using Artificial Intelligence (Al) techniques (blue bars) and by the control group (orange bars). The colored bars on the bottom pro- 
vide a visual representation of the quality appraisal’s heterogeneity in both groups (a gradient was used to represent the obtained scores: red = 4; 


yellow = 6.5; green = 9). 


vs. 14.0 [6.5—37.2], P = 0.005) and as many per included 
study (3.8 [2.4—5.3] vs. 3.4 [2.0—6.2], P = 0.481). 

We observed no differences in the pre-registered re- 
views’ times to completion (74.0 [47.5—117.5] vs. 123.0 
[53.0—221.0], P = 0.205). The average scores obtained 
in the AMSTAR-2 risk of bias rating were not significantly 
higher in the AI group (7.5 [5.3—9.1] vs. 6.3 [3.98.0] 
points out of 13, P = 0.119), with both groups showing 
high heterogeneity of results as shown in Figure 3. 
Measured against the controls, the AI reviews scored worse 
on question 4 (literature search strategy, —12%) and better 
on question 6 (data extraction in duplicate, 35%), while 
showing minimal differences on question 5 (duplicate 
screening, 7%). Both groups scored the lowest on questions 
7 (providing a list of excluded studies) and 10 (reporting on 
the sources of funding of the included studies). 


3.4. Narrative description of the uses of AI in the 
included reviews 


According to the step of the review process where AI 
was used, we can classify the 20 reviews in the AI group 
in three categories, as shown in Table 2. 


3.4.1. Search process 

Three reviews [31—33] complemented their search pro- 
cedures with open-ended question queries on CORD-19 
[45], an open dataset of COVID-19 related articles struc- 
tured to facilitate the use of text mining and machine 
learning systems: Zaki et al. [32] used a GitHub repository 


based on the Okapi BM25 search algorithm; Zaki et al. [33] 
employed BioBERT, a peer-reviewed [46] and open-source 
text mining system pre-trained for biomedical content anal- 
ysis; and Parasa et al. [31] provided no details on the search 
engine employed. Additionally, Michelson et al. [34] used 
proprietary software from the ‘“‘GenesisAI’”? company to 
produce a “rapid meta-analysis” as proof-of-concept of 
their product. Daley et al. [35] disclosed no information 
on the software employed. Only two reviews in this sub- 
group were published, and none registered a protocol. 
The average AMSTAR-2 score was 3.7/13. 


3.4.2. Filtering of randomized controlled trials 

Seven articles [25,26,36—40] employed RobotSearch, a 
peer-reviewed [47] and open-source software to identify 
randomized controlled trials (RCT) from a citations list. 
It is based on a neural network trained with data from Co- 
chrane’s reviews and stands out for its ease of use (no 
installation is required) and flexibility (as it allows for 
different levels of sensitivity, including one developed spe- 
cifically for systematic reviews, as well as integration with 
other scripts). 

In our sample, RobotSearch was often incorporated in 
the workflows of living or partially automated reviews. 
Two of the reviews that made use of RobotSearch were 
Bartoszko et al. [25], a network meta-analysis of the evi- 
dence for COVID-19 prophylaxis, and Siemieniuk et al. 
[26], a living meta-analysis of randomized trials to inform 
World Health Organization (WHO) Living Guidelines on 
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Table 2. Al tools used in COVID-19 reviews: Table showing the different artificial intelligence (Al) tools that have been used in the elaboration of 
COVID-19 systematic reviews, according to their area of application: search assistance, randomized controlled trials (RCT) filtering and 
screening automation 


Ref. 
[31] 


[32] 


[33] 


[34] 


[35] 


[36] 


[37] 


[25] 


[26] 


[38] 


[39] 


Title 


Prevalence of 
Gastrointestinal 
Symptoms and Fecal 
Viral Shedding in 
Patients with Coronavirus 
Disease 2019 


The influence of 
comorbidity on the 
severity of COVID-19 
disease: systematic 
review and analysis 


The Estimations of the 
COVID-19 Incubation 
Period: A Scoping 
Reviews of the Literature 


Ocular toxicity and 
Hydroxychloroquine: A 
Rapid Meta-Analysis 

A Systematic Review of the 
Incubation Period of 
SARS-CoV-2: The Effects 
of Age, Biological Sex, 
and Location on 
Incubation Period 


Impact of remdesivir on 
28 day mortality in 
hospitalized patients with 
COVID-19: February 
2021 Meta-analysis 


Impact of systemic 
corticosteroids on 
hospitalized patients with 
COVID-19: January 2021 
Meta-analysis of 
randomized controlled 
trials 


Prophylaxis against COVID- 
19: living systematic 
review and network meta- 
analysis 


Drug treatments for COVID- 
19: living systematic 
review and network meta- 
analysis 


Adverse effects of 
remdesivir, 
hydroxychloroquine, and 
lopinavir/ritonavir when 
used for COVID-19: 
systematic review and 
meta-analysis of 
randomized trials 


Tocilizumab and sarilumab 
alone or in combination 
with corticosteroids for 
COVID-19: A systematic 
review and network meta- 
analysis 


Authors 


Parasa et al. 


Zaki et al. 


Zaki et al. 


Michelson 
et al. 


Daley et al. 


Robinson 
et al. 


Robinson 
et al. 


Bartoszko 
et al. 


Siemieniuk 
et al. 


Izcovich 
et al. 


Zeraatkar 
et al. 


Journal 


JAMA Network 
Open 


Pre-print 


Journal of 
Infection and 
Public Health 


Pre-print 


Pre-print 


Pre-print 


Pre-print 


BMJ 


BMJ 


Pre-print 


Pre-print 


Al used in... 


Search 


Search 


Search 


Search 


Search 


RCT filtering 


RCT filtering 


RCT filtering 


RCT filtering 


RCT filtering 


RCT filtering 


Software used 
CORD-19 


CORD-19 + 
Okapi BM25 


CORD- 


19 + BioBERT 


GenesisAl 
(formerly Evid 
Science) 


Not reported 


RobotSearch 


RobotSearch 


RobotSearch 


RobotSearch 


RobotSearch 


RobotSearch 


Is open 
source? 


Partially 


Yes 


Yes 


No 


No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


(Continued) 
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Table 2. Continued 


Ref. 
[40] 


[41] 


[27] 


[28] 


[29] 


[42] 


[43] 


[44] 


[30] 


Title 


Clinical trials in COVID-19 
management & 
prevention: A meta- 
epidemiological study 
examining 
methodological quality 


Impacts of school closures 
on physical and mental 
health of children and 
young people: a 
systematic review 

Prediction models for 
diagnosis and prognosis 
of COVID-19: systematic 
review and critical 
appraisal 


Rapid, point-of-care 
antigen and molecular- 
based tests for diagnosis 
of SARS-CoV-2 infection 
(Review) 


Signs and symptoms to 
determine if a patient 
presenting in primary 
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drugs for treatment of COVID-19, of which Izcovich et al. 
[38] and Zeraatkar et al. [39] are separate sub-studies. Both 
are part of the “BMJ Rapid Recommendations” project and 
maintain a website where summaries of the evidence avail- 
able and interim analyses are published. The average 
AMSTAR-? score was 7.5/13. 


3.4.3. Screening of titles and abstracts 

We found eight articles [27—30,41—44] that made use of 
Al-powered screening procedures. Five of them 
[27—29,41,42] used EPPI-Reviewer, a web-based tool 
(distributed as shareware) to assist in the elaboration of 
all kinds of literature reviews. It offers a wide variety of 
features, from bibliographic management to collaborative 
working, as well as study identification capabilities, auto- 
matic clustering of articles, and text mining. In particular, 
the included reviews used its “SGCClassifier’’ module to 
prioritize the screening of articles more likely to be 
included. As a result, both Wynants et al. [27] and two Co- 
chrane reviews [28,29] quoted a 80% reduction in the 
screening burden due to this tool. 

Similar screening automation techniques from system- 
atic reviews’ elaboration platforms were used by other 
two articles: SWIFT-Active Screener [48] by Elmore 
et al. [43], which was set to achieve a certain study recall 
objective as the screening’s stopping criterion; and Evi- 
dence Prime by Chu et al. [30], to double-check the 
screening process. Finally, Alkofide et al. [44] used Ab- 
strackr, the only open-source software in this category, 
which uses feedback from previously selected and rejected 
articles to guide the screening process. Evaluations of this 
tool published in the literature [49] suggest high workload 
savings in the production of systematic reviews at the cost 
of 0.1% false negative rates. 

Among the reviews analyzed in this study, this subgroup pre- 
sented the highest scores in the AMSTAR-2 appraisal tool (9.1/ 
13), with the notable mentions of two Cochrane reviews [28,29] 
(12 points) and a rapid meta-analysis [30] published in the Lan- 
cet (10.5 points). Contrary to reviews in the other categories that 
prioritized search depth, the use of AIl-powered tools in this sub- 
group was motivated by the screening burden faced by the re- 
viewers: quoting Dinnes et al. [28], “a more efficient 
approach [was needed] to keep up with the rapidly increasing 
volume of COVID-19 literature”. 


4. Discussion 


We evaluated if the potential benefits of deploying AI in 
evidence syntheses have been realized in COVID-19 re- 
views. We found that AI was rarely utilized, appearing 
in only 0.7% of the studied reviews, but that it was signif- 
icantly associated with reductions in authors’ screening 
workload and publication in journals with higher Impact 
Factor. Being a living review was associated with using 
AI, with the most common use cases being the 


optimization of screening (prioritizing studies with high 
likelihood of being relevant) and the selection of random- 
ized controlled trials. 

As a limitation of our study, we would highlight its low 
statistical power due to the small number of reviews using 
AI. Anticipating the limited availability of reviews with AI, 
we adopted a highly sensitive screening procedure, process- 
ing more than 7,000 bibliographic references of COVID-19 
systematic reviews (combining expert advice in the selection 
of keywords and a fully-featured search engine), and chose a 
3:1 control group size to minimize the risk of type II statisti- 
cal errors. Using L-OVE as our primary database allowed ac- 
cess to all relevant and updated sources in a systematic and 
machine-readable way; however, our search strategy might 
show a reduced sensitivity to institutional reports and white- 
papers, often not indexed by traditional databases. The 
impact of download errors and excluding non-open-access 
reviews from our study is uncertain; its influence on general- 
izing our results should be interpreted in light of the diversity 
of secondary sources reachable through L-OVE and the high 
accessibility of COVID-19 research during the pandemic. 
Furthermore, the use of publication dates as a matching var- 
iable allowed for a bias-minimizing (script-driven) selection 
of controls but it prevented the use of other desirable control- 
ling variables such as review sizes or goals. 

We also note that reporting workloads “per author” 
instead of “‘per reviewer participating in the screening” 
may underestimate workload measurements for large teams 
(when not all their authors participate in the screening). A 
higher author count might also be related to resource avail- 
ability, and thus access to expert advice regarding AI. Like- 
wise, better-resourced groups with AI expert support might 
have greater access to well-indexed journals, potentially 
biasing Impact Factor analyses in favor of AI. The 
AMSTAR-?2 tool was inevitably applied without blinding 
the reviewers to use or non-use of AI, which, given the sub- 
jectiveness of certain aspects of the methodology assess- 
ment, might have influenced this evaluation. Finally, the 
use of citation counts to measure reviews’ impact has known 
deficiencies such as being influenced by citation bias or the 
authority of the authors [50], and this approach may under- 
estimate the impact of recently published reports. 

On average, it takes 15 months for teams of five reviewers to 
complete a traditional systematic review [51], with estimated 
screening error rates of around 10% [52]. Facing the COVID- 
19 pandemic demanded robust evidence summaries with ur- 
gency as delays incurred cost in terms of lost lives and economic 
damage. However, despite the explosive growth that the AI and 
machine learning fields have experienced during the last years, 
they played a surprisingly limited role in COVID-19 evidence 
synthesis. Our findings are consistent with previous reports 
[14] that the benefits AI can provide in the conduct of system- 
atic reviews are unknown to most review authors, while the rela- 
tive unorthodoxy of its methods might initially hinder their 
acceptance by the research community. Open-source software, 
more prone to community adoption, will be essential in this 
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aspect. Hopefully, our article will raise the profile of AI in evi- 
dence syntheses. 

Our narrative description of the reviews included in this 
study showed that none made use of more than one AI-tool. 
A more cohesive approach, seamlessly merging AI into 
every step of the review process, would save reviewers’ 
time trying to interconnect different tools with sometimes 
incompatible formats. Semi-automated screening proced- 
ures were one of the areas where AI showed more adoption, 
and the variety of software options (such as EPPI-Reviewer, 
already adopted as a Cochrane Review Production Tool) 
was higher. On the contrary, full automation was only em- 
ployed by RobotSearch (an extensively appraised random- 
ized trials identifier), suggesting that the adoption of 
increasingly automated solutions may be hindered by the 
need to further assess their potential cost on recall and 
risk-of-bias against their productivity contributions. 


5. Conclusion 


The need for automated solutions in research synthesis is 
obvious, as reviewers’ workload is growing with the rapidly 
expanding biomedical field. Adoption of new technologies 
can take time, but realizing AI’s potential in evidence syn- 
thesis should be a priority. Going forward, AI must be 
incorporated to systematic reviews as the next step toward 
timely, better, and more responsive decision-making. 
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